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Abstract 


Six  application  benchmarks,  including  four  numerical  aerodynamic  simulation  (NAS)  codes, 
provided  by  H.  Jin  and  J.  Wu,  were  previously  parallelized  using  OpenMP  and  message-passing 
interface  (MPI)  and  run  on  a  128-processor  Silicon  Graphics  Inc.  (SGI)  Origin  2000.  Detailed 
profile  data  were  collected  to  understand  the  factors  causing  imperfect  scalability.  The  results 
show  that  load  imbalance  and  cost  of  remote  accesses  are  the  main  factors  in  limited  speedup  of 
the  OpenMP  versions,  whereas  communication  costs  are  the  single  major  factor  in  the 
performance  of  the  MPI  versions. 
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1.  Introduction 


Over  the  last  several  years,  several  portable  mechanisms  for  developing  parallel  programs 
have  been  standardized.  This  set  includes  relatively  low-level  libraries  like  the  message-passing 
interface  (MPI),  parallelization  directives  like  OpenMP,  and  higher  level  languages  including 
high-performance  Fortran  (HPF).  Unlike  the  use  of  vendor-specific  libraries  and  compiler 
directives,  these  libraries  and  language  extensions  are  supported  on  a  large  number  of  systems. 
At  the  same  time,  distributed  shared  memory  (DSM)  systems  are  emerging  as  an  important  class 
of  parallel  machines.  This  includes  both  the  hardware-DSM  systems  like  the  Silicon  Graphics 
Inc.  (SGI)  Origin  2000  and  software-DSM  systems  like  Treadmarks.  The  main  advantage  of 
such  systems  is  that  the  programmers  have  the  option  of  programming  them  using  either  a  shared 
memory  or  message-passing  paradigm  or  both. 

In  this  report,  an  experimental  study  is  presented  to  answer  the  following  question:  What  are 
the  main  obstacles  (among  factors  like  communication  costs,  false  sharing,  and  synchronization 
costs)  in  achieving  scalable  performance  through  each  of  the  paradigms?  Though  answers  have 
been  attempted,  the  issue  remains  contentious  [1].  Six  benchmark  programs  are  used,  including 
four  numerical  aerodynamics  simulation  (NAS)  codes  and  two  irregular  computational  fluid 
dynamics  (CFD)  codes.  The  performance  of  OpenMP  and  MPI  versions  of  these  programs  is 
examined  on  a  128-processor  Origin  2000.  Besides  comparing  the  scalability  of  these  versions, 
hardware  counter-based  performance  data  are  used  to  understand  the  difference  between  the 
performance  of  different  versions  and  the  reasons  for  imperfect  scalability. 

In  section  2,  the  programming  environments  and  benchmarks  used  for  the  experimental  study 
are  explained.  The  results  from  the  experiments  are  presented  and  analyzed  in  section  3,  and  the 
conclusions  are  presented  in  section  4. 
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2.  Programming  Environment 


2.1  Origin  2000.  The  Origin  2000  is  a  DSM  architecture.  The  machine  utilized  for  this 
study  is  part  of  the  U.S.  Army  Research  Laboratory’s  (ARL)  Major  Shared  Resource  Center 
(MSRC)  supercomputing  assets.  The  largest  configuration  available  is  comprised  of  128  nodes. 
Each  processor  has  1  GB  of  local  memory.  Each  processor  is  a  million  instructions  per  second 
(MIPS)  R12000  (R12k)  64-bit  central  processing  unit  (CPU)  running  at  300  MHz  with  two 
32-kB  primary  caches  and  one  8-MB  secondary  cache.  The  older  R10000  (RlOk)  64-bit  chips 
ran  at  195  MHz,  with  two  32-kB  primary  caches  and  one  4-MB  secondary  cache. 

An  interesting  aspect  of  the  Origin  2000  system  is  its  capability  for  reporting  detailed  profile 
information  to  the  application  programmers.  The  MIPS  R12k  and  the  older  R  10k  are  two  of  the 
very  few  systems  in  which  the  hardware  counters  are  made  visible  to  the  end-users  of  the 
machine.  A  small  set  of  events  is  monitored  by  the  hardware  counters,  including  cache  misses, 
memory  coherence  operations,  floating-point  operations,  and  branch  mispredictions.  Because 
this  monitoring  is  done  in  hardware  rather  than  software,  it  is  possible  to  extract  detailed 
information  about  the  state  of  the  system  without  affecting  the  behavior  of  the  program  being 
monitored. 

hi  this  study,  profiling  data  are  collected  by  running  the  codes  with  perfex,  a  profiling  tool 
that  reports  a  count  for  the  32  countable  event  types,  with  no  modifications  to  the  targeted 
program  and  with  only  a  minimal  effect  on  its  execution  time.  Focus  was  on  the  subset  of  event 
counts  that  are  indicative  of  specific  performance  inhibitors  to  scalability.  Table  1  shows  the 
performance  inhibitors  that  were  examined  and  the  corresponding  event  counts  that  were  used  to 
evaluate  those  potential  problems. 

2.2  Parallel  Programming  Environments.  This  study  concentrated  on  using  OpenMP  as 
the  mechanism  for  shared-memory  programming.  The  Origin  2000  can  also  be  programmed  as  a 
message-passing  machine  using  MPI,  which,  like  OpenMP,  is  portable  across  a  number 
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Table  1.  Using  Event  Counts  for  Performance  Problem  Identification 


Performance  Problem 

Event  Count 

Load  Imbalance 

Number  of  floating-point  operations  issued  per  process  is  not 
comparable. 

Excessive  Synchronization 

Number  of  store  conditionals  is  high. 

False  Sharing 

Number  of  store  exclusives  to  a  shared  block  is  high. 

Cache  Unfriendly 

LI  and  L2  cache  misses  are  high. 

of  platforms.  MipsPro  7.2.1  compiler  was  used,  and  the  applications  were  compiled  with  f77 
using  aggressive  optimizations  (-03  flag). 

2.3  Benchmarks  and  Problem  Sizes.  This  study  of  OpenMP  is  focused  on  four  of  the  NAS 
Parallel  Benchmarks  (NPBs),  which  are  most  relevant  to  the  Army’s  applications,  and  two 
additional  benchmarks,  called  IRREG  and  LES.  The  NPB  set  was  developed  by  the  NAS 
Program  at  NASA  Ames  Research  Center  for  the  performance  evaluation  of  parallel  computing 
systems  [2].  NPBs  mimic  the  computation  and  data  movement  characteristics  of  large-scale 
CFD  applications.  This  study  focused  on  three  simulated  application  codes  (LU  solver  [LU],  SP, 
and  block  tridiagonal  [BTJ),  and  one  kernel  (conjugate  gradient  [CG]). 

The  assumption  is  that  MPI  can  give  the  run  with  the  least  amount  of  computing  time 
requirement.  The  NAS-optimized  MPI  version  of  the  four  kernels  tested  was  then  the  basis  for 
comparison.  In  this  study,  MPI  implementations  of  the  benchmarks  were  obtained  from  the 
NPB  2.3  NAS  website  [3].  The  rationale  behind  the  PBN  versions  given  by  the  working  team  is 
to  provide  the  community  with  an  optimized  version  of  NPB  2.3-serial  and  a  sample  OpenMP 
implementation.  The  NPB  and  PBN  versions  specify  three  problem  sizes  for  the  benchmarks. 
This  report  focuses  on  the  Class  B  problem  sizes,  as  they  are  the  closest  in  size  to  realistic 
problem  sizes,  as  defined  by  the  applications  commonly  run  at  ARL.  Table  2  shows  the  problem 
sizes  for  Class  A,  Class  B,  and  Class  C  for  each  benchmark.  A  comparison  in  processing  times 
between  Class  B  and  C  is  given  in  Figure  1. 
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Table  2.  NPB  Problem  Sizes  (Number  of  Elements) 


Benchmark  Code 

Class  A 

Class  B 

Class  C 

BT 

643 

1623 

LU 

643 

mvm 

1623 

Pentadiagonal  Solver 

643 

HE5H 

1623 

CG 

12,000 

75,000 

150,000 

CG:  Comparison  Between  Classes 


Figure  1.  Timing  for  MPI  Version  of  CG  for  Classes  B  and  C,  12k  Chip. 

Another  benchmark  that  has  been  focused  on  is  the  large  eddy  simulation  (LES)  [4].  T  ,ES 
can  be  used  to  characterize  turbulent  flow,  where  large-length  scales  signify  the  domain  size  and 
small-length  scales  represent  dissipative  eddies.  Although  small  scales  are  modeled  due  to  their 
isotropic  nature,  high-performance  computing  (HPC)  resources  are  required  to  capture  the  large 
energy-carrying  length  scales.  In  this  report,  a  vectorized  simulation  code  is  optimized  and 
parallelized  for  Origin  2000  performance.  A  realistic  simulation  of  flow  past  a  backward-facing 
step  with  a  problem  size  of  32  x  32  x  32  is  used  to  study  scaling  behavior.  Periodic  boundary 
conditions  are  applied  in  the  stream-wise  and  span-wise  directions. 

The  second  non-NAS  benchmark  being  examined  is  IRJREG  [5].  IRREG  is  abstracted  from  a 
CFD  application  that  uses  unstructured  meshes  to  model  a  physical  problem.  The  mesh  is 
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represented  by  nodes,  edges  that  connect  two  nodes,  and  faces  that  connect  three  or  four  nodes. 
For  the  realistic  submarine  mesh  used  in  these  benchmark  runs,  the  number  of  nodes,  edges,  and 
faces  were  92,564,  623,003,  and  504,947,  respectively. 


3.  Experimental  Results 


In  this  section,  a  comparison  of  the  performance  of  OpenMP  and  MPI  versions  of  four  NAS 
codes  and  two  irregular  CFD  codes  using  OpenMP  is  presented. 


3.1  Comparing  OpenMP  and  MPI  Performance.  The  performance  for  OpenMP  and  MPI 
versions  of  CG,  LU,  SP,  and  BT  are  shown  in  Figures  2  and  3.  The  plots  show  wall-clock  time 
as  a  function  of  the  number  of  processors.  In  general,  two  observations  can  be  made  from  these 
four  plots. 


Figure  2.  Scalability  of  OpenMP  and  MPI  Versions  of  CG  and  LU. 


Reasonably  good  scalability  is  achieved  when  up  to  64  processors  for  all  of  the  8  programs 
(2  versions  for  each  of  4  benchmarks)  are  used.  The  speedup  starts  leveling  off  for  configurations 
beyond  64  processors,  which  shows  that  problem  sizes  in  NAS  Class  B  data  sets  are  not  suitable 
for  parallelization  on  a  very  large  number  of  processors.  For  three  of  the  four  applications,  MPI 
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Figure  3.  Scalability  of  OpenMP  and  MPI  Versions  of  SP  and  BT. 

achieves  better  performance  than  the  OpenMP  versions.  The  MPI  versions  are  significantly 
faster  for  LU  and  CG,  slightly  better  on  large  configurations  for  SP,  and  slightly  slower  on  BT. 

Of  this  set  of  benchmarks,  using  the  RlOk  chip,  the  poorest  speedups  are  achieved  for  CG. 
On  128  processors,  the  OpenMP  version  achieves  a  speedup  of  14.  A  slightly  higher  speedup  of 
15  is  achieved  at  the  64-processor  configuration.  The  performance  of  the  MPI  version  of  CG  is 
significantly  better  on  16,  32,  64,  and  128  processors.  On  both  the  64-  and  128-processor 
configurations,  the  MPI  version  achieves  a  speedup  of  43.  For  LU,  OpenMP  scales  reasonably 
well  until  64  processors,  achieving  a  speedup  of  32.  The  MPI  version  has  significantly  better 
speedup  again,  achieving  a  factor  of  80  on  128  processors.  For  BT,  OpenMP  achieves  a  speedup 
of  nearly  50  on  128  processors.  The  speedup  of  the  MPI  version  is  only  30.  It  should  be  noted 
that  the  1 -processor  version  of  MPI  performs  much  worse  as  compared  to  the  OpenMP 
sequential  version  of  this  code.  The  results  from  MPI  and  OpenMP  are  the  closest  in  the  case  of 
SP.  Speedup  of  nearly  50  is  achieved  on  121  processors  for  both  the  versions.' 

How  profiling  data  from  perfex  can  be  used  to  determine  the  reasons  for  imperfect  scalability 
and  the  differences  in  performance  of  OpenMP  and  MPI  versions  of  the  programs  is  now 


This  code  was  executed  on  121  processors  because  it  required  a  square  number  of  processors. 
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discussed.  For  a  shared-memory  program  run  on  a  DSM  architecture,  the  following  factors 
usually  contribute  to  a  lower-than-ideal  speedup:  load  imbalance,  which  implies  that  the  work  in 
parallelized  loops  is  not  evenly  distributed  among  the  processors,  and  synchronization  costs, 
which  denote  the  time  spent  by  the  processors  in  coordinating  the  progress  of  the  computation 
among  themselves.  False  sharing  occurs  when  two  or  more  processors  access  different  variables 
that  happen  to  be  colocated  on  the  same  cache  block,  with  at  least  one  of  the  accesses  being  a 
write.  Once  the  write  occurs,  the  entire  cache  line  is  invalidated  to  other  processors.  Remote 
accesses  indicate  frequent  references  to  off-processor  data,  which  are  expensive  compared  to 
references  to  local  data. 

For  the  message-passing  versions,  the  two  common  causes  of  imperfect  speedup  are 
communication  costs  and  load  imbalance.  Since  single-program,  multiple-data  (SPMD)  versions 
of  programs  are  run  and  there  is  no  shared-memory  support,  false  sharing  and  synchronization 
costs  do  not  occur.  For  each  of  the  eight  programs  in  which  scalability  numbers  have  been 
presented,  hardware  counter  data  obtained  from  perfex  were  analyzed.  For  all  OpenMP 
programs,  the  event  counts  and  typical  times  obtained  for  synchronization  and  false  sharing  were 
extremely  low  (less  than  3  s),  even  for  the  highest  number  of  processors  used.  In  general,  a  good 
level  of  cache  friendliness  was  seen  for  all  programs  except  CG.  Cache  friendliness  was 
examined  by  looking  at  the  average  LI  and  L2  cache  hit  rates  returned  by  perfex.  L2  cache  hit 
rates  were  consistently  higher  than  0.9  for  each  of  the  eight  programs,  and  LI  cache  hit  rates 
were  also  greater  than  0.9  for  all  programs  except  CG.  CG  is  an  irregular  code;  therefore,  poor 
LI  locality  is  achieved.  The  load  imbalance  issue  was  examined  by  looking  at  the  number  of 
floating-point  operations  performed  over  different  processors  in  each  ran.  A  load-balanced 
program  will  have  very  similar  numbers  for  the  number  of  floating-point  operations  performed 
across  all  processors.  Figure  4  shows  the  same  data  for  OpenMP  and  MPI  versions  of  CG. 
Detailed  data  from  LU  and  SP  are  not  presented  here,  but  trends  are  explained  later.  Figure  5 
shows  the  minimum,  maximum,  and  average  number  of  cycles  spent  on  floating-point  operations 
across  all  processors  on  the  OpenMP  and  MPI  versions  of  BT.  The  increase  in  range  with  an 
increasing  number  of  processors  suggests  a  problem  with  load  balancing. 
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CC-MPl 


Figure  4.  Max,  Average,  and  Min  MFlops  Across  All  Processors  for  OpenMP  and  MPI  for 
CG. 


BT-OMP 


Figure  5.  Max,  Average,  and  Min  MFlops  Across  All  Processors  for  OpenMP  and  MPI  for 
BT. 

The  difference  in  the  number  of  floating-point  operations  between  different  processors 
explains  the  limited  speedup  (50  times  on  128  processors)  of  the  OpenMP  version  of  BT.  The 
results  are  very  different  from  the  MPI  version  of  the  same  code.  For  the  OpenMP  version,  the 
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load  is  evenly  balanced  between  different  processors  on  all  processor  configurations. 
Interestingly,  the  OpenMP  version  gives  overall  better  performance  than  the  MPI  version  of  BT. 
A  possible  explanation  for  poor  performance  of  the  MPI  version  is  the  high  communication 
costs. 

The  performance  of  the  OpenMP  version  can  be  further  improved  by  better  work 
distribution.  The  program  typically  has  nested  loops  where  the  number  of  iterations  across  each 
dimension  is  102  (for  Class  B).  The  loop-level  parallelized  OpenMP  version  achieves 
parallelism  across  only  a  single  dimension,  so  there  is  no  way  of  using  more  than  102  processors. 
Possibly,  by  using  additional  directives  or  by  using  SPMD-style  OpenMP  parallelism,  the 
performance  of  the  OpenMP  version  of  BT  can  be  enhanced. 

Similar  trends  are  seen  from  CG.  Excellent  load  balance  is  demonstrated  by  the  MPI 
version,  leading  to  good  performance.  Load  imbalance  can  be  seen  for  the  OpenMP  version, 
though  it  is  not  as  severe  as  in  the  case  of  BT.  Because  of  the  irregular  accesses  in  this  code,  the 
high  cost  of  frequent  nonlocal  references  is  likely  to  be  another  important  factor  behind  limited 
speedup.  Unfortunately,  perfex  does  not  provide  a  mechanism  for  accurately  measuring  the 
number  of  nonlocal  references.  Also,  remote  references  can  be  aggregated  in  message-passing 
versions,  which  is  not  possible  in  a  shared-memory  version. 

hi  the  case  of  SP,  the  OpenMP  version  achieves  good  load  balance  on  100  processors.  The 
number  of  floating-point  operations  performed  by  each  processor  only  ranges  from  21.21  x  106/s 
to  18.81  x  106/s.  However,  on  121  processors,  some  of  the  processors  do  not  get  any  work,  for 
similar  reasons  as  BT.  Good  load  balance  for  the  OpenMP  version  explains  why  the 
performance  of  the  OpenMP  and  MPI  versions  is  very  similar. 

With  LU,  significant  load  imbalance  is  observed  with  OpenMP.  On  64  processors,  the 
number  of  floating-point  operations  performed  by  each  processor  ranges  from  29.63  x  106/s  to 
14.03  x  1 06/ s.  The  load  imbalance  for  the  OpenMP  version  explains  the  difference  in  the 
performance  of  OpenMP  and  MPI  versions. 
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3.2  Communication  Cost  Issues.  The  performance  of  the  benchmark  BT  under  MPI  lagged 
under  OpenMP  (see  Figure  3).  To  understand  the  issues  involved,  VAMPIR  and 
V AMP1RTRACE  (parallel  processing  tools  from  Pallas  GmbH)  were  run.  The  tools  give  a 
breakdown  of  time  spent  for  different  tasks,  including  MPI,  and  also  identify  load  imbalances. 
Runs  were  made  with  16,  36,  and  64  processors.  In  the  latter,  65%  of  the  total  processing  time 
was  spent  on  MPI.  The  MPI  runs  also  showed  that  load  imbalances  were  present  (i.e.,  only  9  of 
the  processors  in  the  64  processor  case  were  actually  50%  occupied  by  the  application,  while  in 
17  of  the  processors,  this  figure  was  less  than  25%).  Improving  MPI  processing  is  feasible,  but 
was  not  attempted  here. 

3.3  OpenMP  Implementation  of  Irregular  CFD  Codes.  Both  of  the  non-NAS 
benchmarks,  LES  and  IRREG,  were  parallelized  using  the  SPMD  style  of  OpenMP  that  relies 
heavily  on  domain  decomposition.  While  domain  decomposition  can  result  in  a  coarse-grain 
program  exhibiting  good  scalability,  it  does  transfer  the  responsibility  of  decomposition  from  the 
compiler  to  the  programmer.  Once  the  problem  domain  is  decomposed,  the  same  sequential 
algorithm  is  followed  but  is  modified  to  handle  the  multiple  subdomains.  The  program  is 
replicated  on  each  thread  but  has  different  extents  for  the  subdomains.  Also,  data  that  are  local 
to  a  subdomain  (not  shared  globally)  are  specified  as  private  or  thread  private.  Thread  private  is 
used  for  subdomain  data  that  need  file  scope  or  are  used  in  common  blocks.  Also,  message 
passing  is  replaced  by  shared  data  that  can  be  read  by  any  thread. 

For  LES,  initialization  of  the  data  is  parallelized  using  one  parallel  region  for  better  data 
locality  among  active  processors.  The  main  computational  kernel  is  embedded  in  the 
time-advancing  loop.  The  time  loop  is  treated  sequentially,  and  the  kernel  itself  is  parallelized 
using  another  parallel  region.  In  this  parallel  region,  the  32  x  32  x  32  mesh  is  blocked  in  the 
z-direction  and  each  block  is  tasked  to  a  different  processor. 

The  IRREG  code  contains  a  series  of  loops  that  iterate  over  nodes,  edges,  and  faces.  The 
loops  over  edges  and  faces  involve  indirect  accesses  to  memory  locations,  which  are  difficult  to 
analyze  and  parallelize  in  a  loop-level  sense.  However,  a  parallel  version  of  the  code  can  be 
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accomplished  by  partitioning  the  nodes  among  the  processors.  The  edges  and  faces  are  assigned 
to  the  processor  that  owns  a  majority  of  the  corresponding  nodes.  The  recursive  coordinate 
bisection  (RCB)  partitioner  used  in  the  code  does  not  optimally  minimize  the  number  of  cut 
edges  (communication  effort)  but  does  attempt  to  reduce  the  amount  of  communication  and  load 
balance  the  computational  work.  The  performance  of  LES  and  IRREG  is  shown  in  Figure  6.  A 
speedup  of  5.1  is  obtained  on  eight  processors.  In  LES,  the  matrix  solver,  the  most  expensive 
module,  is  made  cache-friendly  by  optimizing  it  for  single  CPU  efficiency.  Inherent  data 
dependencies  contribute  to  the  imperfect  scaling  observed  for  eight  processors.  For  IRREG,  the 
speedup  was  measured  on  up  to  32  processors.  Again,  the  parallel  version  scaled  quite  well, 
with  a  factor  of  30.0  on  32  processors.  The  speedup  results  obtained  from  initial  attempts  to 
parallelize  IRREG  using  loop-level  parallelization  resulted  in  almost  no  speedups.  Data  and 
work  distribution  using  specialized  partitioned  were  extremely  important  for  die  parallel 
performance  of  this  code,  which  could  not  be  achieved  through  directives  for  loop-level 
parallelism. 


CFO  CODE  RESULTS 


Figure  6.  Performance  of  OpenMP  of  LES  and  IRREG  With  the  RlOk  Chip. 
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4.  Conclusions 


In  this  report,  experiments  have  been  conducted  to  study  the  performance  achieved  through 
shared-memory  (OpenMP  implementations)  and  message-passing  (MPI  implementations) 
paradigms  for  six  benchmark  programs  with  realistic  problem  sizes  run  on  a  128-processor 
Origin  2000  with  both  the  older  RlOk  and  the  newer  R12k  chips.  Moreover,  hardware-profiling 
data  were  analyzed  to  understand  the  reasons  for  imperfect  speedups  of  these  codes. 

These  experiments  lead  to  several  interesting  observations.  A  somewhat  better  performance 
was  obtained  from  MPI  programs,  as  compared  to  the  OpenMP.  The  main  factor  behind  limited 
scalability  of  the  OpenMP  versions  was  load  imbalance.  Only  the  outer  loops  were  parallelized, 
and,  on  large  configurations,  not  all  processors  could  be  kept  busy.  The  second  most  important 
performance  obstacle  for  OpenMP  versions  was  the  cost  of  remote  references.  False  sharing  and 
synchronization  costs  were  insignificant  for  the  programs  in  this  benchmark  set. 

The  MPI  versions  demonstrated  excellent  load  balance,  with  parallelism  obtained  through 
domain  decomposition.  The  main  factor  in  the  limited  scalability  of  MPI  versions  was 
communication  costs.  The  MPI  codes’  communication  costs  are  indeed  higher  as  shown  by 
VAMPIR  TRACE  data.  This  experience  in  developing  parallel  versions  of  two  irregular  CFD 
codes  found  that  the  SPMD  style  parallelization  facility  of  OpenMP  enabled  easy  and  efficient 
parallelization  of  these  applications. 

This  study  concluded  that  programmers  need  to  concentrate  on  achieving  good  work 
distribution  while  optimizing  the  performance  of  OpenMP  versions,  and  they  need  to  concentrate 
on  improving  communication  performance  while  optimizing  the  performance  of  MPI  versions. 
These  conclusions  are  applicable  only  to  the  programs  that  have  similar  features  to  the 
benchmark  programs  studied  here. 
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